This page last changed on Feb 03, 2009 by straha1.

This page gives very detailed information about running programs on HPC. It assumes you have worked through our three part tutorial and compiled the parallel and serial hello world programs. If you are merely looking for a quick reference, check out our HPC Quickstart, which is targeted towards experienced users.

Table of Contents

Starting Point

I assume that you have compiled your program. Specifically, I assume you have created the executables described in this tutorial. The executables is assumed to sit in your current directory, in which you want to run your code, where possible input files are located, and where you wish to collect the output files (both stdout and stderr captured by the scheduler and any other files that your code might create).

Overview of the Scheduler

All users are required to use the scheduler to run code on hpc. No other method of starting jobs, whether serial or parallel, is acceptable. Please note this, because it is the most important 'good user behavior' in a shared system such as hpc.

A job, that is, an executable with its command-line arguments, is submitted to the scheduler with the qsub command. With qstat, you can see the status of the queue at any time. If you wish to delete a job for any reason, use the qdel command. There are additional commands, but these should get you started. They are explained in more detail below.

All scheduler commands have man pages. Also look under the "See Also" heading at the bottom of all man pages for cross-references to other pages.

The Scheduler Command qdel

I know this is out of order. But before this little piece of text gets buried after all the other remarks, let me point out that you can kill your own jobs. This applies both to removing a job from the queue before it runs or killing it while it is running.
To delete a job from the queue or to kill a running job cleanly, use the qdel command with the job number of the job to be deleted, for instance, say "qdel 636". The job number can be obtained from the job listing from qstat (see below) or you might have noted it from the response of the call to qsub when you originally submitted the job (also below). See man qdel for more information.

Running serial code on hpc

In the directory, in which you want to run your code, you need to create a script file that tells the scheduler more details about how to start the code, what resources you need, where to send output, and some other items. Let's call this file qsub-hello.p001 in this example:

#!/bin/bash

#PBS -N MPI_Hello
#PBS -o .
#PBS -e .
#PBS -W umask=007
#PBS -q low_priority
#PBS -l nodes=1

cd $PBS_O_WORKDIR

mpirun --machinefile $PBS_NODEFILE ./hello-world-c-gcc

The hello-world-c-gcc should be replaced by the name you gave your executable. This script is used as command-line argument to the qsub command by saying

qsub qsub-hello.p001

at the Linux prompt, which results in

$ qsub qsub-hello.p001
1198.hpc.cl.rs.umbc.edu

Notice that qsub responds with the job number, here 1198, assigned to your job. Here, the internal hostname with the "cl" appears.

You choose the name of your job with the option -N; this name will appear in the queue that you can see by qstat (see below). Choose a meaningful (but not too long) name for your code here.

The options -o and -e tell the scheduler in which directory to place the stdout and stderr files, respectively. These files have the form jobnumber.hpc.cl.ER and jobnumber.hpc.cl.OU, respectively, at present since the jobnumber is a four-digit number; it it becomes a five-digit number, we will likely lose the letter "l" and get jobnumber.hpc.c.ER and jobnumber.hpc.c.OU. These files are created and accumulated in some temporary place and only moved to your directory after completion of the job.

The line -W umask=007 is important in the context of research groups, if all members of the group need to have any permissions on the stdout and stderr files created by the scheduler. Without this line, only the user who ran the job would have any permissions, and no other group member (not even the PI) can even read these files.

The -q low_priority specifies which queue to submit your job to. The queue low_priority is the default on on hpc.

The script qsub-hello.p001 is an example of a qsub script for serial code, therefore, we simply request 1 'node' with the option "-l nodes=1"; the terminology of 'node' here is slightly unclear, because it was probably created in the days when nodes had only one (single-core) processor. More properly, in modern terminology, you are requesting the use of one core of one processor with this script. Notice that you are not requesting one processor, since it is possible for the scheduler to assign a second job to the same processor, that is, you indeed will only get one core with this request. Tests on our system have demonstrated that this is generally the most efficient way to use the resources of hpc; see the technical report HPCF-2008-1 on this webpage.

When the scheduler starts executing this script, its working directory is your home directory. But the environment variable PBS_O_WORKDIR holds on to the directory, in which you started your job, which is typically not your home directory. To get back to this directory, the script first of all executes the line "cd $PBS_O_WORKDIR". From now on, you are again in the directory, where this file qsub-script is located and where you issued the qsub command. Hence, we can access the executable in that directory by ./hello in the mpirun line. This directory change is crucial in particular, if your code reads in an input file and/or creates output files. Without the cd command, your executable will not be found; also, input files cannot be accessed, and output files will all be put in your home directory.

The line starting with mpirun actually starts the job. Its machinefile argument determines at run time which compute nodes are actually used for the parallel processes.

The Scheduler Command qstat

Once you have submitted your job to the scheduler, you will want to confirm that it has been entered into the queue. Use qstat at the command-line to get output similar to this:

$ qstat
Job id              Name             User            Time Use S Queue
------------------- ---------------- --------------- -------- - -----
1198.hpc            MPI_Hello        gobbert                0 Q low_priority

The most interesting column is the one titled S for "status". It shows what your job is doing at this point in time: The letter Q indicates that your job has been queued, that is, it is waiting for resources to become available and will then be executed. The letter R indicates that your job is currently running as in:

$ qstat
Job id              Name             User            Time Use S Queue
------------------- ---------------- --------------- -------- - -----
1198.hpc            MPI_Hello        gobbert                0 R low_priority

Finally, the letter E says that your job is exiting:

$ qstat
Job id              Name             User            Time Use S Queue
------------------- ---------------- --------------- -------- - -----
1198.hpc            MPI_Hello        gobbert         00:00:00 E low_priority

The status E will appear during the shut-down phase, after the job has actually finished execution. See man qstat for more information. Clearly, the phases Q and E might be quite short for many jobs, so you have to say qstat at just the right moment to see these statuses.

Running Parallel Code on HPC

For this example, it is assumed that you have the parallel executable hello_parallel in the current directory. Let's call the qsub script qsub-hello_parallel.p008 in this example:

#!/bin/bash

#PBS -N MPI_Hello
#PBS -o .
#PBS -e .
#PBS -W umask=007
#PBS -q low_priority
#PBS -l nodes=8

cd $PBS_O_WORKDIR

mpirun --machinefile $PBS_NODEFILE ./hello_parallel

Again, hello-world-c-gcc should be replaced by the name you gave your parallel executable. This script is different from the previous one for the serial ./hello-world-c-gcc in only two ways: The executable in the mpirun line is now ./hello_parallel that you linked with OpenMPI. And we request 8 'nodes' with "-l nodes=8"; again, this terminology is misleading, because we are actually requesting 8 cores.

This script is used as command-line argument to the qsub command by saying

qsub qsub-hello_parallel.p008

at the Linux prompt, which results in

$ qsub qsub-hello_parallel.p008
1199.hpc.cl.rs.umbc.edu

If you use simply qstat, you will see the same progression of the job through the statuses Q, R, and E for this job. But for parallel jobs, one can get a little more information by using "qstat -a", namely in particular the number of nodes used, despite the somewhat garbled appearance of the output if your window is not wide enough. To find out even more, namely exactly which nodes your job is running on, you use "qstat -n", which implies the -a option. This might give you a result such as this:

$ qstat -n

hpc.cl.rs.umbc.edu:
                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
1199.hpc.cl.rs.umbc. gobbert  low_prio MPI_Hello     --      2   1    --  16:00 R   --
   node032+node032+node032+node032+node031+node031+node031+node031

This shows that the scheduler assigned all 4 cores (both cores of both each dual-core processor) of node032 and node031 to the job. This is the default behavior of using the first available cores, starting with node032 and proceeding by counting down the node number to node001.

Continuing this example, the following directory listing shows the files that exist in the directory after the job has finished:

$ ls -l
total 76
-rw-rw---- 1 gobbert pi_gobbert     0 Jul 21 17:47 1199.hpc.cl.ER
-rw-rw---- 1 gobbert pi_gobbert   640 Jul 21 17:47 1199.hpc.cl.OU
-rwxrwx--- 1 gobbert pi_gobbert 44121 Jul 21 17:37 hello_parallel
-rw-rw---- 1 gobbert pi_gobbert   406 Jul 21 17:32 hello_parallel.c
-rw-rw---- 1 gobbert pi_gobbert   197 Jul 21 17:24 qsub-hello_parallel.p008

Here, hello_parallel.c is the source code and qsub-hello_parallel.p008npn4 the qsub script, both printed above. The file hello_parallel is the executable obtained by compiling the code with mpicc. The file 1199.hpc.cl.ER is the stderr captured by the scheduler. The fact that it is empty (0 bytes long) confirms in this case (where my code does not ordinarily write to stderr) that the job finished without error. The file 1199.hpc.cl.OU is the stdout, which contains the result of the printf commands from each parallel process. Displaying this file by the more utility shows its contents:

$ more 1199.hpc.cl.OU
hello_parallel.c: Number of tasks=0 My rank=8 My name="node032.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=2 My rank=8 My name="node032.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=4 My rank=8 My name="node031.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=3 My rank=8 My name="node032.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=5 My rank=8 My name="node031.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=1 My rank=8 My name="node032.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=6 My rank=8 My name="node031.cl.rs.umbc.edu".
hello_parallel.c: Number of tasks=7 My rank=8 My name="node031.cl.rs.umbc.edu".

We see that each of the 8 parallel processes did output, but not in any particular order. We can confirm here the compute nodes, on which the code was run; notice that these hostnames obtained by the code itself using MPI_Get_processor_name agree with the nodes listed by the qstat -n command above. More precisely, we learn that "qstat -n" lists the names of the nodes in the order of the MPI process ranks, that is, in this example MPI Processes 0, 1, 2, and 3 are on node032, which is listed first by qstat -n.

Environment Variables in MPI Processes

UNIX uses a mechanism called environment variables to find critical programs and libraries, as well as other information. Your PATH variable is one example of this, but there are many other variables as well. You can see the environment variables currently set in your shell by running env which might print something like:

TERM=xterm
SHELL=/bin/bash
SSH_TTY=/dev/pts/0
USER=samtrahan
SHLVL=1
HOME=/home/samtrahan
LOGNAME=samtrahan
PATH=/usr/bin:/usr/local/bin:/bin
LD_LIBRARY_PATH=/usr/local/lib

Your shell isn't the only place where environment variables are used – all programs in UNIX have environment variables. Two of these variables are critical: PATH and LD_LIBRARY_PATH. PATH tells your programs where to find any other programs that they needs to run. LD_LIBRARY_PATH tells it where to find dynamic libraries – chunks of compiled code that is shared between several programs. Those two environment variables must also be set in order for your job to find the mpirun program and its libraries. For example, with gcc-openmpi-1.2.6, /usr/mpi/gcc/openmpi-1.2.6/bin must be in your PATH variable and /usr/mpi/gcc/openmpi-1.2.6/lib64 must be in your LD_LIBRARY_PATH variable.

When mpirun starts an MPI program on several machines, it runs your program once for each processor you requested (the -np option). Each time it starts your program, it creates what is known as a process. A process is merely a copy of your program that is currently running on a machine.

Those new processes have to have their PATH and LD_LIBRARY_PATH variables set correctly or they won't be able to find critical libraries and programs. MVAPICH and copies the environment that your qsub script has and so there is no problem when using MVAPICH. Unfortunately, MVAPICH2 and OpenMPI give your new processes whatever environment variables your new shells on hpc.rs.umbc.edu start with. Thus jobs launched by OpenMPI often will not work if they need to access programs through the PATH variable or dynamic libraries through the LD_LIBRARY_PATH variable.

You can fix this problem by telling OpenMPI or MVAPICH2 to forward specific variables to your new processes. For OpenMPI, you must use the -x option to OpenMPI's mpirun command:

# This command will run eight processes and automatically forward your
# PATH and LD_LIBRARY_PATH variables to the new processes if you are using
# OpenMPI:
mpirun -np 8 -machinefile $PBS_NODEFILE -x PATH -x LD_LIBRARY_PATH ./hello_parallel

That will cause the PATH and LD_LIBRARY_PATH variables to be forwarded to the new processes that mpirun creates. Of course, the -x flag works with other environment variables as well, if you ever come upon the need to use other variables.

MVAPICH2 works slightly differently:

# This command will run eight processes and automatically forward your
# PATH and LD_LIBRARY_PATH variables to the new processes if you are using
# MVAPICH2.  Note that you still have to run mpdboot and mpdallexit.
mpirun_rsh -np 8 -hostfile $PBS_NODEFILE PATH="$PATH" LD_LIBRARY_PATH="$LD_LIBRARY_PATH" ./hello_parallel

If you need to forward other environment variables, you can just add them to that list (MY_SHINY_VARIABLE="$MY_SHINY_VARIABLE"). Just make sure you put them after the last mpirun_rsh option (things like -np 8 and -hostfile $PBS_NODEFILE) and before your program's name (./hello_parallel).

Using Less than Four Processors Per Machine

When requesting several cores by, say, "-l nodes=8", the default behavior of the scheduler is to use the first available cores starting with node032. If one or more cores of node032 (but not all of them) are already in use, these first available cores would be the unused ones on node032. This is appropriate behavior for serial code that uses only one core. But as we want to use 8 cores, we might actually want our code to be concentrated on 2 nodes dedicated to us, using all 4 cores of both. This can be accomplished by the line "-l nodes=2:ppn=4". This line specifies that you request a reservation for 4 'processors per node' ("ppn") on 2 nodes. Recall that each compute node on hpc has two dual-core processors, so up to 4 processors in this sense are available; this abbreviation and terminology pre-dates the appearance of multi-core processors, so in today's time, it should rather say 4 cores per node. This request leads to the scheduler reserving these resources for you. By default, as in this example, mpirun executes as many parallel MPI processes as cores reserved by the scheduler, that is, 8 in this example. Tests on our system have demonstrated that this is generally the most efficient way to use the resources; see the technical report HPCF-2008-1 on this webpage.

Four Cores Per Machine: Efficiency Concerns

While the use of all available cores on hpc has proven efficient and effective, it is sometimes desirable to restrict the use of cores to fewer than the ones reserved to a job. For instance, to use only one core of both processors on one node (i.e., run only 2 parallel processes on each node), still reserve all cores of all desired nodes by the same "-l nodes=2:ppn=4" as in the previous item, but modify the mpirun command to use four processes (more on that below). This leads to only two parallel processes running on both requested nodes, hence you get a total of four parallel processes. The fact that we reserved all 4 cores on both nodes for our job means that the remaining cores cannot be given to any other job, hence are idling. This protects our job from being slowed down by any other user's job. Tests might indicate that such a 4-process job on 2 nodes might be faster than a 4-process job on 1 node. However, since you have 2 nodes (with all 8 cores) reserved to you, the proper comparison is whether this an 8-processor job using all cores on both nodes is not still faster than this 4-processor job on 2 nodes. This has proven to be true on hpc, and hence the general recommendation is to use all available cores reserved to you; see the technical report HPCF-2008-1 on this webpage. Look at "mpirun -h" for more information; the man page obtained by "man mpirun" does not appear to have this information.

Modifying your QSub Script to use Less than Four Processors per Node

No matter which MPI implementation you use, you must still use ppn=4 in your QSub script or you will either end up with fewer machines than you want or you may be sharing machines with other users (or both). In MVAPICH2, you might run into a variety of more interesting and very bad problems if you don't use ppn=4. The modifications you must make to your mpirun or mpiexec command depend on which MPI implementation you are using:

OpenMPI

Firstly, if you want to run less than four processors per node, you must use ppn=4:

#PBS -l nodes=5:ppn=4

You must also change your mpirun command.

mpirun -npernode 2 -machinefile $PBS_NODEFILE ./hello_parallel

The 2 should be replaced with the number of processes you wish to start on each machine and ./hello_parallel should be replaced with your program's name and arguments. Note that we're using -npernode 2 instead of -np 10. That tells OpenMPI to start two processes per node, for each machine in your machinefile. Without it, OpenMPI might start four processes on the first machine, four on the second and one on the third (or it might not, depending on how OpenMPI is configured at the time). Adding -npernode 2 ensures that OpenMPI will run two processes per node. Of course, you can change that 2 to a 1, 3 or 4 to get a different number of processes per node.

MVAPICH

Unfortunately, with MVAPICH, running less than four processes per machine isn't as simple as for OpenMPI. Let's say you create this qsub script, intending to run two processes per machine on five machines:

#!/bin/bash

#PBS -N MPI_Hello
#PBS -o .
#PBS -e .
#PBS -W umask=007
#PBS -q low_priority
#PBS -l nodes=5:ppn=4

cd $PBS_O_WORKDIR

# This script will NOT do what you want it to do.
# We must first create a new machinefile.  Read below.

mpirun -np 10 -machinefile $PBS_NODEFILE ./hello_parallel

You would expect that script to start two processes on each of the five machines you have allocated. Unfortunately, that is not what it will do with the MVAPICH implementations. They will start four processes on the first machine and one on the second. The reason for this is the contents of the $PBS_NODEFILE:

node032.cl.rs.umbc.edu
node032.cl.rs.umbc.edu
node032.cl.rs.umbc.edu
node032.cl.rs.umbc.edu
node031.cl.rs.umbc.edu
node031.cl.rs.umbc.edu
node031.cl.rs.umbc.edu
node031.cl.rs.umbc.edu
node030.cl.rs.umbc.edu
node030.cl.rs.umbc.edu
node030.cl.rs.umbc.edu
node030.cl.rs.umbc.edu
node029.cl.rs.umbc.edu
node029.cl.rs.umbc.edu
node029.cl.rs.umbc.edu
node029.cl.rs.umbc.edu
node028.cl.rs.umbc.edu
node028.cl.rs.umbc.edu
node028.cl.rs.umbc.edu
node028.cl.rs.umbc.edu

When MVAPICH's mpirun sees that file, it assumes that the first ten processes should start on the nodes specified on the first ten lines of the machine file. Hence four of your processes will start on node032.cl.rs.umbc.edu, four on node031 and two on node030. To fix this, you have to create a new machine file with the lines reordered. Let's modify that qsub script to fix the problem:

#!/bin/bash

#PBS -N MPI_Hello
#PBS -o .
#PBS -e .
#PBS -W umask=007
#PBS -q low_priority
#PBS -l nodes=5:ppn=4

: You MUST use ppn=4 if you want to run less than four processes
: per node.

cd $PBS_O_WORKDIR

# Create a file in /tmp with a randomly-generated name:
export CORRECTED_NODEFILE=`mktemp`

# Print two of each machine to the corrected node file
# Replace the 2 below with the number of processes you
# wish to run per node:
uniq $PBS_NODEFILE | perl -ne 'for($i=0;$i<2;$i++) { print }' > "$CORRECTED_NODEFILE"

# Run mpirun with the corrected node file.  Note that the -np
# option should be followed by the total number of processes that
# you wish to run (2 processes per machines * 5 machines = 10
# processes in this case).
mpirun -np 10 -machinefile "$CORRECTED_NODEFILE" ./hello_parallel

The export CORRECTED_NODEFILE=`mktemp` creates a new file in /tmp and stores its name in the CORRECTED_NODEFILE variable. The uniq $PBS_NODEFILE line prints only the unique lines in the nodefile. The output from uniq is sent into the perl -ne ... command which prints each line twice (due to the $i<2). Then we redirect the output of the perl command to the temporary file created by mktemp. The resulting machine file (whose name is in $CORRECTED_NODEFILE) will contain:

node032.cl.rs.umbc.edu
node032.cl.rs.umbc.edu
node031.cl.rs.umbc.edu
node031.cl.rs.umbc.edu
node030.cl.rs.umbc.edu
node030.cl.rs.umbc.edu
node029.cl.rs.umbc.edu
node029.cl.rs.umbc.edu
node028.cl.rs.umbc.edu
node028.cl.rs.umbc.edu

Since the first ten entries are now the first ten machines, MVAPICH's mpirun will now execute two processes per node, which is exactly what you want.

MVAPICH2

To use less than four processors per machine with MVAPICH2, you must create a modified machinefile. The process for creating the new machinefile is the same as for MVAPICH (described above) and so you should read that section for an explanation of how this fix works. However, you must remember to use MVAPICH2's mpirun_rsh command instead. This sample script should work:

#!/bin/bash
: The above line tells Linux to use the shell /bin/bash to execute
: this script.  That must be the first line in the script.

: You must have no lines beginning with # before these
: PBS lines other than the line with /bin/bash
#PBS -N 'hello_parallel'
#PBS -o 'qsub.out'
#PBS -e 'qsub.err'
#PBS -W umask=007
#PBS -q low_priority
#PBS -l nodes=5:ppn=4
#PBS -m bea

: Remember to use ppn=4.

: Change our current working directory to the directory from which you ran qsub:
cd $PBS_O_WORKDIR

# Create a file in /tmp with a randomly-generated name:
export CORRECTED_NODEFILE=`mktemp`

# Print two of each machine to the corrected node file
# Replace the 2 below with the number of processes you
# wish to run per node:
uniq $PBS_NODEFILE | perl -ne 'for($i=0;$i<2;$i++) { print }' > "$CORRECTED_NODEFILE"

# Execute hello_parallel using mpiexec.  The 10 below should be replaced
# with the number of nodes times the number of processors you've requested
# per node.  (In this case: 5 nodes * 2 processors/node = 10 processors.)

mpirun_rsh -hostfile "$CORRECTED_NODEFILE" -np 10 ./hello_parallel
Document generated by Confluence on Mar 31, 2011 15:37